The most important variables associated with death due to COVID‐19 disease, based on three data mining models Decision Tree, AdaBoost, and Support Vector Machine: A cross‐sectional study

Abstract Introduction Death due to covid‐19 is one of the biggest health challenges in the world. There are many models that can predict death due to COVID‐19. This study aimed to fit and compare Decision Tree (DT), Support Vector Machine (SVM), and AdaBoost models to predict death due to COVID‐19. Methods To describe the variables, mean (SD) and frequency (%) were reported. To determine the relationship between the variables and the death caused by COVID‐19, chi‐square test was performed with a significance level of 0.05. To compare DT, SVM and AdaBoost models for predicting death due to COVID‐19 from sensitivity, specificity, accuracy and the area under the rock curve under R software using psych, caTools, random over‐sampling examples, rpart, rpartplot packages was done. Results Out of the total of 23,054 patients studied, 10,935 cases (46.5%) were women, and 12,569 cases (53.5%) were men. Additionally, the mean age of the patients was 54.9 ± 21.0 years. There is a statistically significant relationship between gender, fever, cough, muscle pain, smell and taste, abdominal pain, nausea and vomiting, diarrhea, anorexia, dizziness, chest pain, intubation, cancer, diabetes, chronic blood disease, Violation of immunity, pregnancy, Dialysis, chronic lung disease with the death of covid‐19 patients showed (p < 0.05). The results showed that the sensitivity, specificity, accuracy and the area under the receiver operating characteristic curve were respectively 0.60, 0.68, 0.71, and 0.75 in the DT model, 0.54, 0.62, 0.63, and 0.71 in the SVM model, and 0.59, 0.65, 0.69 and 0.74 in the AdaBoost model. Conclusion The results showed that DT had a high predictive power compared to other data mining models. Therefore, it is suggested to researchers in different fields to use DT to predict the studied variables. Also, it is suggested to use other approaches such as random forest or XGBoost to improve the accuracy in future studies.

approaches such as random forest or XGBoost to improve the accuracy in future studies.

K E Y W O R D S
AdaBoost, COVID-19, data mining models, death, Decision Tree, effective factors, Support Vector Machine

| INTRODUCTION
In December 2019, a pneumonia prevalence of unknown origin was reported in Wuhan city, Hubei province, China. 1 Pneumonia cases were epidemiologically linked to the Huanan seafood wholesale market.Inoculation of respiratory samples into human airway epithelial cells, the Vero E6 and Huh7 cell lines, led to the isolation of a novel respiratory virus, which genome analysis showed to be a novel SARS-Cov related coronavirus and therefore, it was named as acute respiratory syndrome of the coronavirus (SARS-CoV-2) which is a beta coronavirus belonging to the Sarbeco virus subgenus. 2The global spread of SARS-CoV-2 and thousands of deaths from the coronavirus disease (COVID-19) led the World Health Organization to declare the disease a pandemic on March 12, 2020.To this date, the world has taken a heavy toll (including human toll, economic consequences, and increased poverty) in this pandemic. 3,4The family of coronaviruses in humans, mammals and birds have been identified in terms of genotyping and serology with four types of alpha, beta, gamma and delta which causes disease in humans by alpha and beta species. 5 of January 17, 2020, 62 cases of this coronavirus have been confirmed in China and three cases outside of China (2 cases in Thailand and 1 case in Japan). 6This new virus with its unknown nature and high prevalence involved the whole world after a short time. 7The common symptoms of this disease are: fever, cough, shortness of breath, sore throat, myalgia, and rhinorrhea and the important factors of that are age, cardiovascular diseases and chronic lung diseases. 8The results of a systematic review study showed that the elderly, men, blacks, obese people, smokers, diabetics, cardiovascular patients, kidney disease, and hypertension are more at risk of hospitalization due to COVID-19 infection, which can be considered by relevant experts. 9The remarkable thing about patients in the acute phase of this disease is that people suffer from severe complications that may affect their lives for years. 10The coronavirus causes various diseases, including respiratory, intestinal, kidney, heart, and nervous diseases. 11Also, a study has determined that the social, personal, economic and health effects of this disease may remain in societies for many years. 12,13udies have shown that data mining models, including DT, can be useful for modeling and predicting people at risk of COVID-19. 5ny studies have been conducted to identify the most important variables related to various diseases, in which data mining models have been used and the correctness of these models has been confirmed. 14,15Studies have also investigated the most important factors related to death in patients with COVID-19 using data mining algorithms.There are differences in the effect or importance of factors related to death in these patients in many countries, therefore, it is possible to evaluate the role of environmental, geographical, racial, cultural, and nutritional factors in the mortality rate of patients in different places. 16The DT has always been of interest due to its simplicity in fitting and the high speed of execution and the lack of a very high sample size and a specific hypothesis.Also, the SVM algorithm often has a better performance compared to the DT, and in situations where the data is hyperdimensional or the classes are not very clean and separable, using a kernel function has a relatively good performance.However, the performance of the SVM depends on the type of kernel function and the existence of noisy data, and the need for time and sample size is relatively high.On the other hand, Adaptive Boosting (AdaBoost) algorithm is proposed as a basic classifier due to the possibility of training for nonlinear communication, high speed in fitting, and the possibility of using multiple classifiers (such as DT, random forest, and SVM) and with Using the combined technique of Booting is effective in reducing skewness.In a study that was conducted with the aim of comparing data mining models in diagnosing factors related to diseases, the DT performed better than other models. 15In another study conducted for the prognosis of cervical cancer, the AdaBoost algorithm was more accurate than the genetic algorithm in cervical cancer prognosis classifications. 17,18[20][21] It should be noted that there is no absolute "best" among different algorithms and statistical models, and the accuracy of different models in data classification depends on the nature of the data too.Therefore, the model should be selected according to the desired data and performance.According to the existence of different methods for classifying data, to achieve the optimal model, the comparison of models should be done.To the best of our knowledge, no study has yet been undertaken to concurrently identify the most significant variables associated with COVID-19-related mortality in Kermanshah, Western Iran, utilizing the three data mining models (DT, SVM, and AdaBoost).
Considering the importance of identifying the most important factors related to death in COVID-19 patients and to take appropriate measures to reduce the mortality rate in these patients, comparing these three data mining models and determining their sensitivity and specificity to identify the most important factors related to death, we have discussed the disease of COVID-19 in the city of Kermanshah, Iran.

| Type and scope of study
The current cross-sectional study was conducted in Kermanshah hospitals from February 20, 2020, to February 19, 2021.Kermanshah is one of the western provinces of Iran and has a population of about two million people (61.7% urban residents, 37.7% rural residents and the rest nonresidents).It is the capital of the province and its dominant ethnicity is Kurdish. 22

| Sample and methods of collection information
The samples were selected by census method.The sample included outpatients and in patients in Imam Reza, Farabi, Golestan and Shohada hospitals in Kermanshah-Iran.In the initial review, the information of 50,000 people was registered in the HIS system.But a number of these repeated people and the other number of polymerase chain reaction (PCR) results (PCR was the most reliable test at the time of Covid-19 23 ) were excluded from the study.After checking and cleaning, we reached the sample size of 23,054 people.

| Data mining models
In this study, three data mining models Support Vector Machine (SVM), DT, and AdaBoost were used.DT model is one of the most powerful and widely used data mining algorithms for prediction.
Among the reasons for the popularity of this model: clarity, comprehensibility, flexibility and relatively fast processing.In this study, three data mining models SVM, DT, and AdaBoost were used.
These models are among the most powerful and widely used data mining algorithms for prediction. 18,24Among the reasons for their popularity are clarity, comprehensibility, flexibility and relatively fast processing.In this study, the observer method was used to fit the models.For this purpose, 70% of the data were divided for training and 30% for testing (70% training data set, 30% test data set).

| Statistical analysis
To describe the studied variables, the frequency (percentage) of reporting.Chi-square test was used to determine the relationship between independent variables and patients' status (discharged, died).A significance level of 0.05 was considered.Then, the criteria of accuracy, sensitivity, specificity and area under the rock curve were used to compare the predictive power of three data mining models: SVM, DT, and AdaBoost.Data analysis was done under R version 3.5.1 software.To evaluate the models, data mining was done from packages psych, caTools, random over-sampling examples, rpart, rpartplot, and adaboost under this software.
Out of these, 3298 cases required admission to the intensive care unit.Patients had four methods of hospital access available: personal referral, private ambulance, medical center ambulances, or by calling the EMS center at 115.Specifically, 148 individuals (0.6%) utilized private ambulances, 513 people (2.2%) used the 115 center ambulances, 1,333 individuals (7.6%) relied on health center ambulances, and 21,060 people (89.6%) were referred to the health center by personal means.Comparing two groups, one with the outcome event and the other without, revealed a significantly higher mortality rate among men (57.5%) compared to women (42.5%) (p = 0.001).Additionally, patients residing in Kermanshah city (97.7%) experienced a higher incidence of mortality outcomes than those in other cities/county (2.3%) (Table 1).
Among the three reviewed algorithms, when implemented on the data set, the two algorithms AdaBoost, and DT presented a more reasonable feature selection.In the AdaBoost model, the variables of age, chronic blood disease, fever, cough, muscle pain, immune system deficiency, low level of consciousness, diarrhea, and vomiting were selected in order of importance.In the DT model, the variables of age, fever, cough, muscle pain, sense of taste, abdominal pain, vomiting, diarrhea, chronic blood disease, chronic lung disease, and neurological disorders were selected in order of importance.
The models used in this study were compared in terms of accuracy, sensitivity, specificity and ROC curve level (Table 3).
According to the results of Table 3, among the models, the DT model with an accuracy value of 0.75 has the highest accuracy among the 3 models.Also, the DT model with a sensitivity of 0.60 had the highest sensitivity among the models.For the two-feature specificity and Area under the ROC curve, the DT obtained the highest values with values of 0.68 and 0.71 respectively.After the DT model, the AdaBoost model ranks second with values of accuracy, sensitivity, specificity, and area under the ROC curve of 0.74, 0.58, 0.65, and 0.68 respectively.The SVM model is placed in the last place with values of accuracy, sensitivity, specificity, and area under the ROC curve of 0.70, 0.54, 0.62, and 0.63, which shows poor performance compared to the other two models.

| DISCUSSION
Death due to covid-19 is one of the biggest health challenges in the world.Predicting the death of covid-19 patients can be effective in the allocation of human resources and treatment resources.There are many models that can predict death due to COVID-19.6][27] But these three models have not been compared at the same time.Therefore, this issue is the strength of the present study.The results of the present study showed that the accuracy of DT model was higher than SVM and AdaBoost.The results of the study of Parsapour et al. 28 in predicting the severity of depression disorder, the study of NateghiNia et al. 29 in predicting the condition of patients staying in the intensive care unit, the study of Nazeri et al. 30 in the prediction of breast cancer metastasis, the study of Heidari et al. 31 in the diagnosis of infertility showed that the accuracy and predictive power of DT was higher than K-means, neural network, SVM, and K-nearest neighbor.The results of these studies were consistent with the results of the present study.It seems that Demographic characteristics of COVID-19 patients based on vital status.Comparison of models by evaluation indices.
Note: p: p Value from chi-square test; Percentages were calculated on the basis of participants with available data.Abbreviation: AIDS, Acquired Immune Deficiency Syndrome.T A B L E 3